CMSC320: Introduction to Data Science
26 March 2021
Code by Tyler A. Clark

Project 2: Money Ball

Part 1: Wrangling

Problem 1:

Using SQL compute a relation containing the total payroll and winning percentage ($\frac{wins}{games} * 100$) for each team (that is, for each teamID and yearID combination). You should include other columns that will help when performing EDA later on (e.g., franchise ids, number of wins, number of games).

First we'll import the libraries we need and connect to lahman2014 database

Now we will run a query that returns a dataframe with data for each team in each season including variables for the number of wins, losses, games played, win percentage, and total payroll. We opt to use a INNER JOIN when querying the database, because while teams' wins and losses have been thoroughly recorded since the early 20th century, payroll information has not been consistently recorded until the last forty years. Using INNER JOIN we only return teams' seasons with both salary and win-loss data available.

For this project we can see that there is readily available payroll and win-loss data from 1985 to 2014.

For fun visualization we will add two columns with the hexadecimal color codes for the primary and secondary colors for each team, we gathered the hex codes for each team from https://github.com/jimniels/teamcolors. Note that some of the primary and secondary colors have been swapped for clarity in visualization (i.e. The primary color of the New York Yankees in this JSON file is red, it has been swapped with secondary color, navy, the more traditional color associated with the Yankees). Additionally, note that the colors correspond to the Franchise ID and not the Team ID, so teams like the Montreal Expos and Florida Marlins will have the modern colors of their current franchise: The Washington Nationals and Miami Marlins, respectively.

Contents of teams.json and code2team.json can be found in the Appendix A and Appendix B respectively.

Part 2: Explatory Data Analysis

Problem 2:

Write code to produce plots that illustrate the distribution of payrolls across teams conditioned on time (from 1990-2014).

We use the matplotlib library to plot each teams' total payroll for each season from 1985-2014. Note we included five extra years of data (1985-1989), because the data is available. For fun, we plot each team as their primary color.

Question 1:

What statements can you make about the distribution of payrolls conditioned on time based on these plots? Remember you can make statements in terms of central tendency, spread, etc.

The plot above shows each MLB team's total payroll per season over time. The fact that there are 35 teams that played at different times over the 30 year span makes this lineplot noisy and difficult to interpret. We will investigate how efficiently teams spent their payroll further throughout the project, but there are a few key takeaways:

The minimum payroll stays roughly the same over ther 30 year period; however, the maximum payroll increases over the same period. Therefore the median, mean, and variance, must also increase over the 30 year period.

Before proceeding to the next problem, let's replot the graph above in different colors, to see how the Oakland A's total payroll compared to the league each season. This time the Oakland A's lineplot will be their secondary color (Yellow), and all other teams' lineplots will be grey.

The plot above shows each MLB team's total payroll per season over time. The yellow lineplot represents the Oakland A's. From 1985-1989, Oakland seemed to spend around the league average. From 1990-1995, The A's became one of the highest spenders in the MLB relative to the other teams in those seasons. In 1996 the team was acquired by Stephen Schott and Ken Hofmann after the untimely death of the previous owner. The new owners demanded management slash the team's payroll, and since 1996 Oakalnd has had one of the smallest payrolls relative to the other teams in the league every season.

Problem 3:

Write code to produce plots that specifically show at least one of the statements you made in Question 1. For example, if you make a statement that there is a trend for payrolls to decrease over time, make a plot of a statistic for central tendency (e.g., mean payroll) vs. time to show that specficially.

We made statements above about the trends of the minimum payroll (stagnant), and the maximum payroll (increasing), and therefore concluded that the mean, median, and variance should also increase. Note instead of variance, we will use standard deviation to keep the salary unit on the y-axis consistent. To check this, we calculate the minimum, maximum, mean, median, and standard deviation of the MLB's total payroll for each season and plot it as a time series.

The plot above shows the mean total payroll, median total payroll, standard deviation of the total payroll, maximum total payroll, and minimum total payroll over each season from 1985 to 2014.

It is obvious from the plot above that the maximum payroll increases steadily from 1985-2005, and continues to increase although more slowly from 2005-2014. It is less obvious, and worth replotting to prove the point, but the mean, median, and standard deviation all also increase. Contrary to my initial assertion, the minimum does seam to increase as well. Whether one believes the minimum total payroll increases over time largely depends on whether the viewer interprets the 2013 and 2014 minimum salaries as outliers that buck the trend, or a regression to the minimum viable salary.

Below we replot the data above, this time exluding the maximum total payroll data to more clearly observe trends in the other descriptive statistics.

This is a recreation of the same plot above without the maximum total salary lineplot distorting the plot, allowing us to see the other descriptive statistics increase over time. We can clearly see the mean, median, standard deviation, and, to some extent, the minimum total payroll increase from 1985-2014.

Problem 4:

Write code to discretize year into five time periods and then make a scatterplot showing mean winning percentage vs. mean payroll for each of the five time periods. You could add a regression line in each scatter plot to ease interpretation.

We divide the time periods into five buckets of equal width (6 years each) and label the buckets according to their time span. We then create a scatterplot for each time period plotting mean winning percentage vs mean total payroll for each of the six year periods. We add a regression line to each scatterplot to see general correlation between the variables.

Question 2:

What can you say about team payrolls across these periods? Are there any teams that standout as being particularly good at paying for wins across these time periods? What can you say about the Oakland A's spending efficiency across these time periods (labeling points in the scatterplot can help interpretation).

The plots above represent five equal length time periods of six years each from 1985-2014: 1985-1990, 1991-1996, 1997-2002, 2003-2008, and 2009-2014. Each figure plots the mean total payroll for each team over that time period on the x-axis, and the teams win percentage over that time period on the y-axis. Each plot has a linear regression plotted over the data. By simply looking at the linear regression across each period, we see generally that the more money a team spends, the more likely a team is to have a higher win percentage. Of course there are notable outliers:

From 1985-1990, the New York Mets and Toronto Blue Jays appear to pay for wins the most efficiently, far exceedingly the linear regression. From 1991-1996, the Atlanta Braves and Montreal Expos appear to be spend the most efficiently. From 1997-2002, the most efficient team appears to be the Oakland A's. From 2003-2008, the most efficient team appears to be the Minnesota Twins. Finally, from 2009-2014, the most efficient team appears to be the Tampa Bay Rays.

The Oakland A's appear to be efficient spenders during each six year period, with the exception of 1991-1996, the only period where their win percentage falls below the linear regression.

Part 3: Data Transformation

Standardizing across years

It looks like comparing payrolls across years is problematic so let's do a transformation that will help with these comparisons.

Problem 5:

Create a new variable in your dataset that standardizes payroll conditioned on year. So, this column for team i in year j should equal: $standardized\_payroll_{ij} = \frac{{payroll}_{ij} - \overline{payroll}_{j} }{{s}_{j}}$ for team i in year j. where $\overline{payroll}_{j}$ is the average payroll for year j, and $s_j$ is the standard deviation of payroll for year j.

We create a new variable called standard_payroll that standardizes payroll conditioned on year, using the equation described above. We utilize the pandas groupby method to quickly calculate the mean and standard deviation of payrolls for a given year.

From the top rows of the dataframe above we can see the 2001 Seattle Mariners and 1995 Cleveland spent close to the league average in their respective seasons and still had some of the highest win percentages over the 30 year period.

Problem 6:

Repeat the same plots as Problem 4, but use this new standardized payroll variable.

We create a scatterplot for each time period plotting mean winning percentage vs mean standardized payroll for each of the six year buckets. We add a regression line to each scatterplot to see general correlation.

The plots above represent five equal length time periods of six years each from 1985-2014: 1985-1990, 1991-1996, 1997-2002, 2003-2008, and 2009-2014. Each plot has the mean standardized payroll for each team over that time period plotted on the x-axis, and the teams win percentage over that time period plotted on the y-axis. Each plot has a linear regression plotted over the data. By simply looking at the linear regression across each period, we see generally that the more money a team spends, the more likely a team is to have a higher win percentage.

Perhaps surprisingly, our conclusions regarding who spent the most efficiently for each bin of years does not change with the transformation applied to the payroll variable. We explore the transformation further below.

Question 3:

Discuss how the plots from Problem 4 and Problem 6 reflect the transformation you did on the payroll variable.

In order to discuss the visual tranformation, let's discuss the mathematical transformation:

The numerator: ${payroll}_{ij} - \overline{payroll}_{j}$, measures how far a team's total payroll is from the mean total payroll that year. If a team is spending more than the league average, the numerator will be positive. Conversely, if a team is spending less than the league average, then the numerator will be negative. And if a team is spending approximately the league average, then the numerator will be approximately zero. However, the range of salaries differs drastically between years, so in order to "standardize" and compare payrolls from different years, we divide by the denominator which is the standard deviation of total payrolls in a given year. That way we can measure, in a standardized way, how much more or less a team is spending than what we would expect them to in a given season.

Whereas in the Problem 4 plot we saw a teams mean payroll on the x-axis, in Problem 6 we see how many standard deviations a team's total payroll is from the average total payroll in a given year. Previously our x-axis had values in the tens to hundreds of millions, but now standardly measures on a scale of roughly -3 to 4, how much more or less a team spent than the rest of the league, specifically how many standard deviations of payroll more or less a team spent than the rest of the league.

This transformation allows us to meaningfully compare teams' total payroll across seasons.

Expected wins

It's hard to see global trends across time periods using these multiple plots, but now that we have standardized payrolls across time, we can look at a single plot showing correlation between winning percentage and payroll across time.

Problem 7:

Make a single scatter plot of winning percentage (y-axis) vs. standardized payroll (x-axis). Add a regression line to highlight the relationship.

The regression line gives you expected winning percentage as a function of standardized payroll. Looking at the regression line, it looks like teams that spend roughly the average payroll in a given year will win 50% of their games (i.e. win_pct is 50 when standardized_payroll is 0), and teams increase 5% wins for every 2 standard units of payroll (i.e., win_pct is 55 when standardized_payroll is 2). We will see how this is done in general using linear regression later in the course.

From these observations we can calculate the expected win percentage for team i in year j as

${expected\_win\_pct_{ij} = 50+2.5×standardized\_payroll_{ij}}$

We create a single scatter plot of winning percentage (y-axis) vs. standardized payroll (x-axis). Now that we have a standardized payroll variable that measures how much more or less a team spent compared to the rest of the league in a given season, we can compare teams' spending across seasons. Therefore, we can plot one point for each team from every season in one figure. We add a regression line to observe trends in the plot.

Above we have a plot of every team from 1985 to 2014 with their win percentage plotted as a function of their standardized salary. Remember, we standardized salary in order to compare salaries and win percentages between teams across seasons. Standardized salary is roughly a measure of how much more a team spent than the league average that season. This plot is not useful for discerning individual team performances, but rather the examining the general trend of the league. We note that there is a general positive correlation, i.e. teams that spend more relative to the league average win more. However, we can't conclude that they are causally related i.e. there is no indication that spending more leads to more wins. The correlation can be described by the linear regression equation below:

The equation above tells us that a team should expect to have a winning percentage that is 2.53 times the number of standard payroll units plus approximately 50. For example if your team spends four standard deviations more than the league's mean total payroll that season, then you should expect to win $2.53*4+49.98 = 60.1\%$ of your games.

Similarly, if you your team spends exactly the league average total payroll (i.e. 0 standard payroll units) then you should expect to win just shy of 50% of your games. It feels fairly intuitive that if your payroll is the league average then you should win about half of your games. This intuition of course relies on the belief of a causal relationship between payroll and win percentage, which may or may not be true.

As an aside: To think slightly more critically about the potential of a causal relationship, let's look at how strongly correlated win percentage and standardized payroll actually are. The correlation coefficient of the data in scatterplot above is calculated below:

The correlation coefficent above implies that data is far from perfectly correlated, but there is a clear positive correlation that exists.

A further aside:

For a moment let's assume that the variables above are causally related, then what could be another factor or variable that could add noise to this relationship?

The purpose of the Moneyball/Sabremetrics revolution in baseball was to identify which metrics and player statistics translated into wins. It is not uncommon in sports, even in 2021, for teams to spend exorbinant amount of money on players who previously performance indicates that they would add wins, but then due to performance regression, injury, or other factors the player's past performance does not translate into future performance, and teams get a bad deal. Reliance of previous performance to indicate future performance will always be imperfect, and may be one reason why teams who attempt to spend aggressively and efficiently, don't always see that spending translate to wins.

We will now use the linear regression to calculate the expected win percentage of each team.

As mentioned in problem 5, the 2001 Seattle Mariners and 1995 Cleveland had some of the highest win percentages over the 30 year period we have data for, despite spending close to the league average in their respective seasons. We can now see how significantly their win percentages differed from the expected win percentage calculated from their standardized payroll. We will quantify this with the efficiency variable defined below.

Spending efficiency

Using this result, we can now create a single plot that makes it easier to compare teams efficiency. The idea is to create a new measurement unit for each team based on their winning percentage and their expected winning percentage that we can plot across time summarizing how efficient each team is in their spending.

Problem 8:

Create a new field to compute each team's spending effiency, given by

$efficiency_{ij} = win\_pct_{ij} − expected\_win\_pct_{ij}$

for team i in year j, where expected_win_pct is given above.

Make a line plot with year on the x-axis and efficiency on the y-axis. A good set of teams to plot are Oakland, the New York Yankees, Boston, Atlanta and Tampa Bay (teamIDs OAK, BOS, NYA, ATL, TBA).

As Problem 8 asks, we define a new variable called efficiency which is equal to win_pct-expected_win_pct.

Now, we see some interesting results! The most efficiently spent single-season payroll over the 30 year span was 20.66% and belongs to the 2001 Seattle Mariners.

One team that had previously not appeared in the printed portion of the dataframe was the 1994 Montreal Expos. Despite spending approximately 1.65 less standard payroll units than the league's mean payroll in 1994, the Expos won nearly 65% of their games earning them the second highest payroll efficiency of any team in any season between 1985-2014, 19.09%. Some diehard fans and analysts argue that if the 1994 Postseason not been cancelled due to a player's strike, then the Expos would have won the World Series, and therefore would not have relocated to Washington D.C. as the Nationals in 2004.

Because this is a project about Moneyball, we should dicuss the Oakland Athletics. One could argue A's were the most consistently efficient franchise over the 30 year span with six (6) seasons with an efficiency of at least 11%. No other franchise had more than three such seasons during the 30 year span.

Now let's plot the A's efficiency over that period compared some other major league teams:

Question 4:

What can you learn from this plot compared to the set of plots you looked at in Question 2 and 3? How good was Oakland's efficiency during the Moneyball period (2000-2005)?

The plot above shows the "Spending Efficiency" of five MLB teams (the Oakland Athletics, New York Yankees, Boston Red Sox, Atlanta Braves and Tampa Bay Rays) for each season from 1985 to 2014. We define spending efficiency as the difference between a team's actual win percentage and the team's expected win percentage in a given year. Of course expected win percentage, as defined above, is a linear function of a team's standardized payroll, a standard measure of how much more or less a team spent than the league as a whole that season.

The plot can be interpretted in two ways, one can see efficiency as literally the difference between a team's actual win percentage and a team's expected win percentage (as a function of how much they spent). Efficiency in turn can be interpreted as a measure of how efficiently a team spent their payroll. If a team's actual win percentage far exceeded their expected win percentage (again as a function of how much the team spent on payroll), then the team spent their payroll efficiently. If a team's actual win percentage did not meet their expected win percentage, then the team spen their payroll inefficiently. Of course this second interpretation of the plot is predicated on the notion that payroll and win percentage are causally related and that a higher payroll can buy wins.

In the set of plots created for Question 2 and 3, we could identify trends in league spending. We learned from those plots that generally, that regardless of time period (1985-2014), if a team spent more money relative to the league average, then they were more likely to have a higher winning percentage. As we discussed in Question 3, while there is a positive correlation, the correlation coefficient does not indicate a strong positive correlation. So, while spending more money usually translates to a higher winning percentage, that is not always the case. The plot above examines how far teams deviated from how many wins they should expect given how much they spent relative to league spending that year.

Depending on how you measure it--ownership, wins, playoff appearances, or Billy Beane's role in the A's organization--the exact years covering of the Moneyball era are contested. We can see from the plot above that the A's had a positive efficiency from 1999-2006, roughly coinciding with Beane's ascent into the General Manager role, and acquisition of the A's by Stephen Schott and Ken Hofmann. During that period, the A's recorded some of the highest efficiency of any team between 1985 and 2014. For the purpose of this project, we use the defition of the Moneyball era provided in the project handout: 2000-2005.

We replot the efficiency graph above for 2000-2005 those years. For this plot; however, we will plot every team that played from 2000-2005.

The plot above shows the "Spending Efficiency" of every MLB teams for each season from 2000 to 2005. The Oakland A's efficiency is plotted in their secondary color, Yellow.

In 2000, the A's were the 3rd more efficient team with respect to spending; they were second in 2001, first in 2002, first in 2003, fourth in 2004, and fifth in 2005.

Specifically, the 2002 A's, subject of the movie Moneyball, and the 2001 A's had the 5th (16.41%) and 6th (16.21%) greatest payroll efficiencies of any team during any season between 1985-2014, respectively. The only teams with higher efficiencies were the 2001 Seattle Mariner (20.66%), the 1994 Montreal Expos (19.09%), the 1995 Cleveland Indians (18.40%), and the 1998 New York Yankees (16.41%). In fact, between 1985 and 2014, the Oakland A's were the only franchise to have an efficiency of at least 10% in 7 or more seasons, and the only team to have a payroll efficiency of at least 15% in 3 or more seasons.

We can conclude from our analysis that the 2000-2005 Oakland A's were successful in efficiently spending their total payroll to construct a competitive team relative to the rest of the teams in the MLB during that time period.